Tesseract OCR

tesseract 是 Google/HP 的一個開源項目，支持多系統，有預設引擎可以辨識，也可以自己訓練學習庫。

安裝

安裝後，須將OCR安裝路徑加入到環境變數，才能透過command辨識圖片文字。

使用方法

確認command可以調用tesseract

tesseract

Python Module pytesseract

python直接調用本地tesseract ocr的模塊。

安裝

1 2	pip install pytesseract pip install pillow # 還需安裝讀寫圖檔的模塊

使用方法

from PIL import Image
import pytesseract

i = Image.open(filename)
pytesseract.image_to_string(i)

Selenium

控制瀏覽器模擬輸入帳戶信息，透過ocr輸入驗證碼。

啟動

機電在線註冊頁面

以機電在線註冊頁面的驗證碼來測試tesseract的辨識效果如何。

先寫幾個函數

依照元素大小截圖

# 依照元素大小截圖
def crop_screenshot(fullfile, cropfile, element):
    browser.save_screenshot(fullfile)
    if element:
        type = element["type"]
        name = element["name"]

        imgelement = browser.find_element_by_xpath(".//*[@%s=%r]" % (type, name))
        location = imgelement.location
        size = imgelement.size
        rangle = (int(location['x']), int(location['y']), int(
            location['x'] + size['width']), int(location['y'] + size['height']))

        i = Image.open(fullfile)
        fincrop = i.crop(rangle)
        fincrop.save(cropfile)
    else:
        pass

定位元素座標後，在以屏幕截圖的方式。切出只有驗證碼的圖片。

OCR辨識圖片文字

def ImgToOcr(filename):

    i = Image.open(filename)

    i = i.resize((90, 32), Image.ANTIALIAS)  # 調整圖片大小

    Lim = i.convert('L')  # 詼諧

    # 降噪
    threshold = 80
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)

    Lim = Lim.point(table, '1')

    # OCR
    OCR_text = pytesseract.image_to_string(Lim)

    if len(OCR_text) == 4:
        Lim.save("./ocr/%s.png" % OCR_text)

    return OCR_text

原先的圖片太小，無法得到好的辨識效果，試著先將大小調整為(90, 32)，並做詼諧
、二值化，並以辨識文字做為檔名儲存，最後測試辨識20張圖片，效果為何。

測試效果發現辨識率挺不錯的，只有少數幾張的失敗。

小結

預設的辭庫真的挺強大的，只要字體還算清楚，沒有雜點都能準確的辨識。當然，遇到的驗證碼沒有這麼順利的話，還是可以自己訓練的，只是步驟相當繁瑣，如果只是要登入的話，可以使用Cookie登入獲取資料，相對會容易些。

幾個練習破解驗證碼的網站

啟動

機電在線 註冊頁面

小結

機電在線註冊頁面